Search CORE

93 research outputs found

Automated text simplification as a preprocessing step for machine translation into an under-resourced language

Author: Popović Maja
Štajner Sanja
Publication venue: 'Assoc. for Computational Linguistics Bulgaria'
Publication date: 04/09/2019
Field of study

In this work, we investigate the possibility of using fully automatic text simplification system on the English source in machine translation (MT) for improving its translation into an under-resourced language. We use the state-of-the-art automatic text simplification (ATS) system for lexically and syntactically simplifying source sentences, which are then translated with two state-of-the-art English-to-Serbian MT systems, the phrase-based MT (PBMT) and the neural MT (NMT). We explore three different scenarios for using the ATS in MT: (1) using the raw output of the ATS; (2) automatically filtering out the sentences with low grammaticality and meaning preservation scores; and (3) performing a minimal manual correction of the ATS output. Our results show improvement in fluency of the translation regardless of the chosen scenario, and difference in success of the three scenarios depending on the MT approach used (PBMT or NMT) with regards to improving translation fluency and post-editing effort

Irish Universities

DCU Online Research Access Service

Traducción de frases del español ‘original’ al español ‘simplificado’

Author: Štajner Sanja
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/01/2014
Field of study

Text Simplification (TS) aims to convert complex sentences into their simpler variants, which are more accessible to wider audiences. Several recent studies addressed this problem as a monolingual machine translation (MT) problem (translating from ‘original’ to ‘simplified’ language instead of translating from one language into another) using the standard phrase-based statistical machine translation (PB-SMT) model. We investigate whether the same approach would be equally successful regardless of the type of simplification we wish to learn (given that different target audiences require different levels of simplification). Our preliminary results indicate that the standard PB-SMT model might not be able to learn the strong simplifications which are needed for certain users, e.g. people with Down's syndrome. Additionally, we show that the phrase-tables obtained during the translation process seem to be able to capture some adequate lexical simplifications.La Simplificación de Textos (ST) tiene como objetivo la conversión de oraciones complejas en variantes más sencillas, que serán más accesibles para un público más amplio. Algunos estudios recientes han abordado este problema como un problema de traducción automática (TA) monolingüe (traducir de lengua ‘original’ a ‘simplificada’ en lugar de traducir de un idioma a otro), utilizando el modelo estándar de traducción automática basado en frases. En este estudio, investigamos si el mismo enfoque tendrá el mismo éxito independientemente del tipo de simplificación que se quiera estudiar, dado que cada público meta requiere diferentes niveles de simplificación. Nuestros resultados preliminares indican que el modelo estándar podrá no ser capaz de aprender las fuertes simplificaciones que se necesitan para algunos usuarios, e.g. gente con el síndrome de Down. Además, mostramos que las tablas de traducción obtenidas durante el proceso de traducción parecen ser capaces de capturar algunas simplificaciones léxicas adecuadas

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Improving machine translation of English relative clauses with automatic text simplification

Author: Popović Maja
Štajner Sanja
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2018
Field of study

This article explores the use of automatic sentence simplification as a preprocessing step in neural machine translation of English relative clauses into grammatically complex languages. Our experiments on English-to-Serbian and English to-German translation show that this approach can reduce technical post-editing effort (number of post-edit operations) to obtain correct translation. We find that larger improvements can be achieved for more complex target languages, as well as for MT systems with lower overall performance. The improvements mainly originate from correctly simplified sentences with relatively complex structure, while simpler structures are already translated sufficiently well using the original source sentences

Crossref

Irish Universities

DCU Online Research Access Service

New Data-Driven Approaches to Text Simplification

Author: Štajner Sanja
Publication venue
Publication date: 01/01/2015
Field of study

A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of PhilosophyMany texts we encounter in our everyday lives are lexically and syntactically very complex. This makes them difficult to understand for people with intellectual or reading impairments, and difficult for various natural language processing systems to process. This motivated the need for text simplification (TS) which transforms texts into their simpler variants. Given that this is still a relatively new research area, many challenges are still remaining. The focus of this thesis is on better understanding the current problems in automatic text simplification (ATS) and proposing new data-driven approaches to solving them. We propose methods for learning sentence splitting and deletion decisions, built upon parallel corpora of original and manually simplified Spanish texts, which outperform the existing similar systems. Our experiments in adaptation of those methods to different text genres and target populations report promising results, thus offering one possible solution for dealing with the scarcity of parallel corpora for text simplification aimed at specific target populations, which is currently one of the main issues in ATS. The results of our extensive analysis of the phrase-based statistical machine translation (PB-SMT) approach to ATS reject the widespread assumption that the success of that approach largely depends on the size of the training and development datasets. They indicate more influential factors for the success of the PB-SMT approach to ATS, and reveal some important differences between cross-lingual MT and the monolingual v MT used in ATS. Our event-based system for simplifying news stories in English (EventSimplify) overcomes some of the main problems in ATS. It does not require a large number of handcrafted simplification rules nor parallel data, and it performs significant content reduction. The automatic and human evaluations conducted show that it produces grammatical text and increases readability, preserving and simplifying relevant content and reducing irrelevant content. Finally, this thesis addresses another important issue in TS which is how to automatically evaluate the performance of TS systems given that access to the target users might be difficult. Our experiments indicate that existing readability metrics can successfully be used for this task when enriched with human evaluation of grammaticality and preservation of meaning

MAnnheim DOCument Server

Wolverhampton Intellectual Repository and E-theses

CoCo: A tool for automatically assessing conceptual complexity of texts

Author: Hulpus Ioana
Nisioi Sergiu
Štajner Sanja
Publication venue: European Language Resources Association
Publication date: 01/01/2020
Field of study

Traditional text complexity assessment usually takes into account only syntactic and lexical text complexity. The task of automatic assessment of conceptual text complexity, important for maintaining reader's interest and text adaptation for struggling readers, has only been proposed recently. In this paper, we present CoCo - a tool for automatic assessment of conceptual text complexity, based on using the current state-of-the-art unsupervised approach. We make the code and API freely available for research purposes, and describe the code and the possibility for its personalization and adaptation in details. We compare the current implementation with the state of the art, discussing the influence of the choice of entity linker on the performances of the tool. Finally, we present results obtained on two widely used text simplification corpora, discussing the full potential of the tool

MAnnheim DOCument Server

Explaining financial uncertainty through specialized word embeddings

Author: Stuckenschmidt Heiner
Theil Christoph Kilian
Štajner Sanja
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2020
Field of study

MAnnheim DOCument Server

Automatic assessment of conceptual text complexity using knowledge graphs

Author: Hulpus Ioana
Štajner Sanja
Publication venue: Association for Computational Linguistics, ACL
Publication date: 01/01/2018
Field of study

Complexity of texts is usually assessed only at the lexical and syntactic levels. Although it is known that conceptual complexity plays a significant role in text understanding, no attempts have been made at assessing it automatically. We propose to automatically estimate the conceptual complexity of texts by exploiting a number of graph-based measures on a large knowledge base. By using a high-quality language learners corpus for English, we show that graph-based measures of individual text concepts, as well as the way they relate to each other in the knowledge graph, have a high discriminative power when distinguishing between two versions of the same text. Furthermore, when used as features in a binary classification task aiming to choose the simpler of two versions of the same text, our measures achieve high performance even in a default setup

MAnnheim DOCument Server

Effects of lexical properties on viewing time per word in autistic and neurotypical readers

Author: Mitkov Ruslan
Ponzetto Simone Paolo
Yaneva Victoria
Štajner Sanja
Publication venue: Association of Computational Linguistics
Publication date: 08/09/2017
Field of study

Eye tracking studies from the past few decades have shaped the way we think of word complexity and cognitive load: words that are long, rare and ambiguous are more difficult to read. However, online processing techniques have been scarcely applied to investigating the reading difficulties of people with autism and what vocabulary is challenging for them. We present parallel gaze data obtained from adult readers with autism and a control group of neurotypical readers and show that the former required higher cognitive effort to comprehend the texts as evidenced by three gaze-based measures. We divide all words into four classes based on their viewing times for both groups and investigate the relationship between longer viewing times and word length, word frequency, and four cognitively-based measures (word concreteness, familiarity, age of acquisition and imagability).University of Wolverhampton and German Research Foundation (DFG

Wolverhampton Intellectual Repository and E-theses

A spreading activation framework for tracking conceptual complexity of texts

Author: Hulpus Ioana
Stuckenschmidt Heiner
Štajner Sanja
Publication venue: Association for Computational Linguistics, ACL
Publication date: 01/01/2019
Field of study

We propose an unsupervised approach for assessing conceptual complexity of texts, based on spreading activation. Using DBpedia knowledge graph as a proxy to long-term memory, mentioned concepts become activated and trigger further activation as the text is sequentially traversed. Drawing inspiration from psycholinguistic theories of reading comprehension, we model memory processes such as semantic priming, sentence wrap-up, and forgetting. We show that our models capture various aspects of conceptual text complexity and significantly outperform current state of the art

Crossref

MAnnheim DOCument Server